28 research outputs found
Workflow models for heterogeneous distributed systems
The role of data in modern scientific workflows becomes more and more crucial. The unprecedented amount of data available in the digital era, combined with the recent advancements in Machine Learning and High-Performance Computing (HPC), let computers surpass human performances in a wide range of fields, such as Computer Vision, Natural Language Processing and Bioinformatics. However, a solid data management strategy becomes crucial for key aspects like performance optimisation, privacy preservation and security.
Most modern programming paradigms for Big Data analysis adhere to the principle of data locality: moving computation closer to the data to remove transfer-related overheads and risks. Still, there are scenarios in which it is worth, or even unavoidable, to transfer data between different steps of a complex workflow.
The contribution of this dissertation is twofold. First, it defines a novel methodology for distributed modular applications, allowing topology-aware scheduling and data management while separating business logic, data dependencies, parallel patterns and execution environments. In addition, it introduces computational notebooks as a high-level and user-friendly interface to this new kind of workflow, aiming to flatten the learning curve and improve the adoption of such methodology.
Each of these contributions is accompanied by a full-fledged, Open Source implementation, which has been used for evaluation purposes and allows the interested reader to experience the related methodology first-hand. The validity of the proposed approaches has been demonstrated on a total of five real scientific applications in the domains of Deep Learning, Bioinformatics and Molecular Dynamics Simulation, executing them on large-scale mixed cloud-High-Performance Computing (HPC) infrastructures
StreamFlow: cross-breeding cloud with HPC
Workflows are among the most commonly used tools in a variety of execution
environments. Many of them target a specific environment; few of them make it
possible to execute an entire workflow in different environments, e.g.
Kubernetes and batch clusters. We present a novel approach to workflow
execution, called StreamFlow, that complements the workflow graph with the
declarative description of potentially complex execution environments, and that
makes it possible the execution onto multiple sites not sharing a common data
space. StreamFlow is then exemplified on a novel bioinformatics pipeline for
single-cell transcriptomic data analysis workflow.Comment: 30 pages - 2020 IEEE Transactions on Emerging Topics in Computin
Bringing AI pipelines onto cloud-HPC: setting a baseline for accuracy of COVID-19 diagnosis
HPC is an enabling platform for AI. The introduction of AI workloads in the HPC applications basket has non-trivial consequences both on the way of designing AI applications and on the way of providing HPC computing. This is the leitmotif of the convergence between HPC and AI. The formalized definition of AI pipelines is one of the milestones of HPC-AI convergence. If well conducted, it allows, on the one hand, to obtain portable and scalable applications. On the other hand, it is crucial for the reproducibility of scientific pipelines. In this work, we advocate the StreamFlow Workflow Management System as a crucial ingredient to define a parametric pipeline, called “CLAIRE COVID-19 Universal Pipeline”, which is able to explore the optimization space of methods to classify COVID-19 lung lesions from CT scans, compare them for accuracy, and therefore set a performance baseline. The universal pipeline automatizes the training of many different Deep Neural Networks (DNNs) and many different hyperparameters. It, therefore, requires a massive computing power, which is found in traditional HPC infrastructure thanks to the portability-by-design of pipelines designed with StreamFlow. Using the universal pipeline, we identified a DNN reaching over 90% accuracy in detecting COVID-19 lesions in CT scans
Model-Agnostic Federated Learning
Since its debut in 2016, Federated Learning (FL) has been tied to the inner
workings of Deep Neural Networks (DNNs). On the one hand, this allowed its
development and widespread use as DNNs proliferated. On the other hand, it
neglected all those scenarios in which using DNNs is not possible or
advantageous. The fact that most current FL frameworks only allow training DNNs
reinforces this problem. To address the lack of FL solutions for non-DNN-based
use cases, we propose MAFL (Model-Agnostic Federated Learning). MAFL marries a
model-agnostic FL algorithm, AdaBoost.F, with an open industry-grade FL
framework: Intel OpenFL. MAFL is the first FL system not tied to any specific
type of machine learning model, allowing exploration of FL scenarios beyond
DNNs and trees. We test MAFL from multiple points of view, assessing its
correctness, flexibility and scaling properties up to 64 nodes. We optimised
the base software achieving a 5.5x speedup on a standard FL scenario. MAFL is
compatible with x86-64, ARM-v8, Power and RISC-V.Comment: Published at the EuroPar'23 conference, Limassol, Cypru